<!DOCTYPE html>
<!-- saved from url=(0043)http://jalammar.github.io/illustrated-bert/ -->
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time</title>

        
    
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0">

    
    <meta name="description" content="Discussions:
Hacker News (98 points, 19 comments), Reddit r/MachineLearning (164 points, 20 comments)


Translations: Chinese (Simplified), Persian

The year 2018 has been an inflection point for machine learning models handling text (or more accurately, Natural Language Processing or NLP for short). Our conceptual understanding of how best to represent words and sentences in a way that best captures underlying meanings and relationships is rapidly evolving. Moreover, the NLP community has been putting forward incredibly powerful components that you can freely download and use in your own models and pipelines (It’s been referred to as NLP’s ImageNet moment, referencing how years ago similar developments accelerated the development of machine learning in Computer Vision tasks).


  



">
    <meta property="og:description" content="Discussions:
Hacker News (98 points, 19 comments), Reddit r/MachineLearning (164 points, 20 comments)


Translations: Chinese (Simplified), Persian

The year 2018 has been an inflection point for machine learning models handling text (or more accurately, Natural Language Processing or NLP for short). Our conceptual understanding of how best to represent words and sentences in a way that best captures underlying meanings and relationships is rapidly evolving. Moreover, the NLP community has been putting forward incredibly powerful components that you can freely download and use in your own models and pipelines (It’s been referred to as NLP’s ImageNet moment, referencing how years ago similar developments accelerated the development of machine learning in Computer Vision tasks).


  



">
    
    <meta name="author" content="Jay Alammar">

    
    <meta property="og:title" content="The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)">
    <meta property="twitter:title" content="The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)">
    

    <!--[if lt IE 9]>
      <script src="http://html5shiv.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->

    <script async="" src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/analytics.js"></script><script src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/jquery-3.1.1.slim.min.js"></script>
    <script type="text/javascript" src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/d3.min.js"></script>
    <script type="text/javascript" src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/d3-selection-multi.v0.4.min.js"></script>
    <script type="text/javascript" src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/d3-jetpack.js"></script>

    <link rel="stylesheet" href="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/bootstrap.min.css">
    <link rel="stylesheet" href="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/bootstrap-theme.min.css">
    <script src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/bootstrap.min.js"> </script>

    <link rel="stylesheet" type="text/css" href="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/gifplayer.css">
    <script type="text/javascript" src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/jquery.gifplayer.js"></script>

    <!--
    <script data-main="scripts/main" src="scripts/require.js"></script>
    -->
    <link rel="stylesheet" href="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/katex.min.css" integrity="sha384-wE+lCONuEo/QSfLb4AfrSk7HjWJtc4Xc1OiB2/aDBzHzjnlBP4SX7vjErTcwlA8C" crossorigin="anonymous">
    <script src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/katex.min.js" integrity="sha384-tdtuPw3yx/rnUGmnLNWXtfjb9fpmwexsd+lr6HUYnUY4B7JhB5Ty7a1mYd+kto/s" crossorigin="anonymous"></script>

    <link rel="stylesheet" type="text/css" href="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/style.css">
    <link rel="alternate" type="application/rss+xml" title="Jay Alammar - Visualizing machine learning one concept at a time" href="http://jalammar.github.io/feed.xml">

    <meta name="viewport" content="width=device-width">
    <!-- Created with Jekyll Now - http://github.com/barryclark/jekyll-now -->

    <!-- Piwik -->
    <!-- Piwik
    <script type="text/javascript">
        var _paq = _paq || [];
        _paq.push(["setDomains", ["*.example.org"]]);
        _paq.push(['trackPageView']);
        _paq.push(['enableLinkTracking']);
        (function() {
            var u="https://a.jalammar.com/";
            _paq.push(['setTrackerUrl', u+'piwik.php']);
            _paq.push(['setSiteId', '1']);
            var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
            g.type='text/javascript'; g.async=true; g.defer=true; g.src=u+'piwik.js'; s.parentNode.insertBefore(g,s);
        })();
    </script>
    <noscript><p><img src="https://a.jalammar.com/piwik.php?idsite=1" style="border:0;" alt="" /></p></noscript>-->
    <!-- End Piwik Code -->

    <!-- End Piwik Code -->
  <style type="text/css">#mc_embed_signup input.mce_inline_error { border-color:#6B0505; } #mc_embed_signup div.mce_inline_error { margin: 0 0 1em 0; padding: 5px 10px; background-color:#6B0505; font-weight: bold; z-index: 1; color:#fff; }</style></head>

  <body style="zoom: 1;">
    <div class="wrapper-masthead">
      <div class="container">
        <header class="masthead clearfix">
          <a href="http://jalammar.github.io/" class="site-avatar"><img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/1007956"></a>

          <div class="site-info">
            <h1 class="site-name"><a href="http://jalammar.github.io/">Jay Alammar</a></h1>
            <p class="site-description">Visualizing machine learning one concept at a time</p>
          </div>

          <nav>
            <a href="http://jalammar.github.io/">Blog</a>
            <a href="http://jalammar.github.io/about">About</a>
          </nav>
        </header>
      </div>
    </div>

    <div id="main" role="main" class="container">
      <article class="post">
  <h1>The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)</h1>

  <div class="entry prediction">
    <p><span class="discussion">Discussions:
<a href="https://news.ycombinator.com/item?id=18751469" class="hn-link">Hacker News (98 points, 19 comments)</a>, <a href="https://www.reddit.com/r/MachineLearning/comments/a3ykzf/r_the_illustrated_bert_and_elmo_how_nlp_cracked/" class="">Reddit r/MachineLearning (164 points, 20 comments)</a>
</span>
<br>
<span class="discussion">Translations: <a href="https://blog.csdn.net/qq_41664845/article/details/84787969">Chinese (Simplified)</a>, <a href="http://blog.class.vision/1397/09/bert-in-nlp/">Persian</a></span></p>

<p>The year 2018 has been an inflection point for machine learning models handling text (or more accurately, Natural Language Processing or NLP for short). Our conceptual understanding of how best to represent words and sentences in a way that best captures underlying meanings and relationships is rapidly evolving. Moreover, the NLP community has been putting forward incredibly powerful components that you can freely download and use in your own models and pipelines <span class="faded_text">(It’s been referred to as <a href="http://ruder.io/nlp-imagenet/">NLP’s ImageNet moment</a>, referencing how years ago similar developments accelerated the development of machine learning in Computer Vision tasks)</span>.</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/transformer-ber-ulmfit-elmo.png">

</div>

<!--more-->

<p><span class="faded_text">(ULM-FiT has nothing to do with Cookie Monster. But I couldn’t think of anything else..)</span></p>

<p>One of the latest milestones in this development is the <a href="https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html">release</a> of <a href="https://github.com/google-research/bert">BERT</a>, an event <a href="https://twitter.com/lmthang/status/1050543868041555969">described</a> as marking the beginning of a new era in NLP. BERT is a model that broke several records for how well models can handle language-based tasks. Soon after the release of the paper describing the model, the team also open-sourced the code of the model, and made available for download versions of the model that were already pre-trained on massive datasets. This is a momentous development since it enables anyone building a machine learning model involving language processing to use this powerhouse as a readily-available component – saving the time, energy, knowledge, and resources that would have gone to training a language-processing model from scratch.</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/bert-transfer-learning.png">
  <br>
  The two steps of how BERT is developed. You can download the model pre-trained in step 1 (trained on un-annotated data), and only worry about fine-tuning it for step 2. [<a href="https://commons.wikimedia.org/wiki/File:Documents_icon_-_noun_project_5020.svg">Source</a> for book icon].
</div>

<p>BERT builds on top of a number of clever ideas that have been bubbling up in the NLP community recently – including but not limited to <a href="https://arxiv.org/abs/1511.01432">Semi-supervised Sequence Learning</a><span class="faded_text"> (by <a href="https://twitter.com/iamandrewdai">Andrew Dai</a> and <a href="https://twitter.com/quocleix">Quoc Le</a>)</span>, <a href="https://arxiv.org/abs/1802.05365">ELMo</a> <span class="faded_text">(by <a href="https://twitter.com/mattthemathman">Matthew Peters</a> and researchers from <a href="https://allenai.org/">AI2</a> and <a href="https://www.engr.washington.edu/about/bldgs/cse">UW CSE</a>)</span>, <a href="https://arxiv.org/abs/1801.06146">ULMFiT</a> <span class="faded_text">(by fast.ai founder <a href="https://twitter.com/jeremyphoward">Jeremy Howard</a> and <a href="https://twitter.com/seb_ruder">Sebastian Ruder</a>)</span>, the <a href="https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf">OpenAI transformer</a> <span class="faded_text">(by OpenAI researchers <a href="https://twitter.com/alecrad">Radford</a>, <a href="https://twitter.com/karthik_r_n">Narasimhan</a>, <a href="https://twitter.com/timsalimans">Salimans</a>, and <a href="https://twitter.com/ilyasut">Sutskever</a>)</span>, and the Transformer <span class="faded_text">(<a href="https://arxiv.org/pdf/1706.03762.pdf">Vaswani et al</a>)</span>.</p>

<p>There are a number of concepts one needs to be aware of to properly wrap one’s head around what BERT is. So let’s start by looking at ways you can use BERT before looking at the concepts involved in the model itself.</p>

<h2 id="example-sentence-classification">Example: Sentence Classification</h2>
<p>The most straight-forward way to use BERT is to use it to classify a single piece of text. This model would look like this:</p>

<p><img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/BERT-classification-spam.png"></p>

<p>To train such a model, you mainly have to train the classifier, with minimal changes happening to the BERT model during the training phase. This training process is called Fine-Tuning, and has roots in <a href="https://arxiv.org/abs/1511.01432">Semi-supervised Sequence Learning</a> and ULMFiT.</p>

<p>For people not versed in the topic, since we’re talking about classifiers, then we are in the supervised-learning domain of machine learning. Which would mean we need a labeled dataset to train such a model. For this spam classifier example, the labeled dataset would be a list of email messages and a labele (“spam” or “not spam” for each message).</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/spam-labeled-dataset.png">
</div>

<p>Other examples for such a use-case include:</p>

<ul>
  <li><strong>Sentiment analysis</strong>
    <ul>
      <li>Input: Movie/Product review. Output: is the review positive or negative?</li>
      <li>Example dataset: <a href="https://nlp.stanford.edu/sentiment/">SST</a></li>
    </ul>
  </li>
  <li><strong>Fact-checking</strong>
    <ul>
      <li>Input: sentence. Output: “Claim” or “Not Claim”</li>
      <li>More ambitious/futuristic example:
        <ul>
          <li>Input: Claim sentence. Output: “True” or “False”</li>
        </ul>
      </li>
      <li><a href="https://fullfact.org/">Full Fact</a> is an organization building automatic fact-checking tools for the benefit of the public. Part of their pipeline is a classifier that reads news articles and detects claims (classifies text as either “claim” or “not claim”) which can later be fact-checked (by humans now, by with ML later, hopefully).</li>
      <li>Video: <a href="https://www.youtube.com/watch?v=ddf0lgPCoSo">Sentence embeddings for automated factchecking - Lev Konstantinovskiy</a>.</li>
    </ul>
  </li>
</ul>

<h2 id="model-architecture">Model Architecture</h2>

<p>Now that you have an example use-case in your head for how BERT can be used, let’s take a closer look at how it works.</p>

<p><img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/bert-base-bert-large.png"></p>

<p>The paper presents two model sizes for BERT:</p>

<ul>
  <li>BERT BASE – Comparable in size to the OpenAI Transformer in order to compare performance</li>
  <li>BERT LARGE – A ridiculously huge model which achieved the state of the art results reported in the paper</li>
</ul>

<p>BERT is basically a trained Transformer Encoder stack. This is a good time to direct you to read my earlier post <a href="https://jalammar.github.io/illustrated-transformer/">The Illustrated Transformer</a> which explains the Transformer model – a foundational concept for BERT and the concepts we’ll discuss next.</p>

<p><img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/bert-base-bert-large-encoders.png"></p>

<p>Both BERT model sizes have a large number of encoder layers (which the paper calls Transformer Blocks) – twelve for the Base version, and twenty four for the Large version. These also have larger feedforward-networks (768 and 1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the default configuration in the reference implementation of the Transformer in the initial paper (6 encoder layers, 512 hidden units, and 8 attention heads).</p>

<h3 id="model-inputs">Model Inputs</h3>

<p><img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/bert-input-output.png"></p>

<p>The first input token is supplied with a special [CLS] token for reasons that will become apparent later on. CLS here stands for Classification.</p>

<p>Just like the vanilla encoder of the transformer, BERT takes a sequence of words as input which keep flowing up the stack. Each layer applies self-attention, and passes its results through a feed-forward network, and then hands it off to the next encoder.</p>

<p><img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/bert-encoders-input.png"></p>

<p>In terms of architecture, this has been identical to the Transformer up until this point (aside from size, which are just configurations we can set). It is at the output that we first start seeing how things diverge.</p>

<h3 id="model-outputs">Model Outputs</h3>

<p>Each position outputs a vector of size <em>hidden_size</em> (768 in BERT Base). For the sentence classification example we’ve looked at above, we focus on the output of only the first position (that we passed the special [CLS] token to).</p>

<p><img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/bert-output-vector.png"></p>

<p>That vector can now be used as the input for a classifier of our choosing. The paper achieves great results by just using a single-layer neural network as the classifier.</p>

<p><img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/bert-classifier.png"></p>

<p>If you have more labels (for example if you’re an email service that tags emails with “spam”, “not spam”, “social”, and “promotion”), you just tweak the classifier network to have more output neurons that then pass through softmax.</p>

<h2 id="parallels-with-convolutional-nets">Parallels with Convolutional Nets</h2>

<p>For those with a background in computer vision, this vector hand-off should be reminiscent of what happens between the convolution part of a network like VGGNet and the fully-connected classification portion at the end of the network.</p>

<p><img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/vgg-net-classifier.png"></p>

<h2 id="a-new-age-of-embedding">A New Age of Embedding</h2>

<p>These new developments carry with them a new shift in how words are encoded. Up until now, word-embeddings have been a major force in how leading NLP models deal with language. Methods like Word2Vec and Glove have been widely used for such tasks. Let’s recap how those are used before pointing to what has now changed.</p>

<h3 id="word-embedding-recap">Word Embedding Recap</h3>

<p>For words to be processed by machine learning models, they need some form of numeric representation that models can use in their calculation. Word2Vec showed that we can use a vector (a list of numbers) to properly represent words in a way that captures <em>semantic</em> or meaning-related relationships (e.g. the ability to tell if words are similar, or opposites, or that a pair of words like “Stockholm” and “Sweden” have the same relationship between them as “Cairo” and “Egypt” have between them) as well as syntactic, or grammar-based, relationships (e.g. the relationship between “had” and “has” is the same as that between “was” and “is”).</p>

<p>The field quickly realized it’s a great idea to use embeddings that were pre-trained on vast amounts of text data instead of training them alongside the model on what was frequently a small dataset.  So it became possible to download a list of words and their embeddings generated by pre-training with Word2Vec or GloVe. This is an example of the GloVe embedding of the word “stick” (with an embedding vector size of 200)</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/glove-embedding.png">
  <br>
  The GloVe word embedding of the word "stick" - a vector of 200 floats (rounded to two decimals). It goes on for two hundred values.
</div>

<p>Since these are large and full of numbers, I use the following basic shape in the figures in my posts to show vectors:</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/vector-boxes.png">
  <br>
</div>

<h3 id="elmo-context-matters">ELMo: Context Matters</h3>

<p>If we’re using this GloVe representation, then the word “stick” would be represented by this vector no-matter what the context was. “Wait a minute” said a number of NLP researchers <span class="faded_text">(<a href="https://arxiv.org/abs/1705.00108">Peters et. al., 2017</a>, <a href="https://arxiv.org/abs/1708.00107">McCann et. al., 2017</a>, and yet again <a href="https://arxiv.org/pdf/1802.05365.pdf">Peters et. al., 2018 in the ELMo paper</a> )</span>, “<em>stick</em>”” has multiple meanings depending on where it’s used. Why not give it an embedding based on the context it’s used in – to both capture the word meaning in that context as well as other contextual information?”. And so, <em>contextualized</em> word-embeddings were born.</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/elmo-embedding-robin-williams.png">
  <br>
  Contextualized word-embeddings can give words different embeddings based on the meaning they carry in the context of the sentence. Also, <a href="https://www.youtube.com/watch?v=OwwdgsN9wF8">RIP Robin Williams</a>
</div>

<p>Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word in it an embedding. It uses a bi-directional LSTM trained on a specific task to be able to create those embeddings.</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/elmo-word-embedding.png">
  <br>

</div>

<p>ELMo provided a significant step towards pre-training in the context of NLP. The ELMo LSTM would be trained on a massive dataset in the language of our dataset, and then we can use it as a component in other models that need to handle language.</p>

<p>What’s ELMo’s secret?</p>

<p>ELMo gained its language understanding from being trained to predict the next word in a sequence of words - a task called <em>Language Modeling</em>. This is convenient because we have vast amounts of text data that such a model can learn from without needing labels.</p>

<div class="img-div-any-width">
  <p><img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/Bert-language-modeling.png">
  <br>
  A step in the pre-training process of ELMo: Given “Let’s stick to” as input, predict the next most likely word – a <em>language modeling</em> task. When trained on a large dataset, the model starts to pick up on language patterns. It’s unlikely it’ll accurately guess the next word in this example. More realistically, after a word such as “hang”, it will assign a higher probability to a word like “out” (to spell “hang out”) than to “camera”.</p>
</div>

<p>We can see the hidden state of each unrolled-LSTM step peaking out from behind ELMo’s head. Those come in handy in the embedding proecss after this pre-training is done.</p>

<p>ELMo actually goes a step further and trains a bi-directional LSTM – so that its language model doesn’t only have a sense of the next word, but also the previous word.</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/elmo-forward-backward-language-model-embedding.png">
  <br>
  <a href="https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018">Great slides</a> on ELMo
</div>

<p>ELMo comes up with the contextualized embedding through grouping together the hidden states (and initial embedding) in a certain way (concatenation followed by weighted summation).</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/elmo-embedding.png">
</div>

<h2 id="ulm-fit-nailing-down-transfer-learning-in-nlp">ULM-FiT: Nailing down Transfer Learning in NLP</h2>
<p>ULM-FiT introduced methods to effectively utilize a lot of what the model learns during pre-training – more than just embeddings, and more than contextualized embeddings. ULM-FiT introduced a language model and a process to effectively fine-tune that language model for various tasks.</p>

<p>NLP finally had a way to do transfer learning probably as well as Computer Vision could.</p>

<h2 id="the-transformer-going-beyond-lstms">The Transformer: Going beyond LSTMs</h2>
<p>The release of the Transformer paper and code, and the results it achieved on tasks such as machine translation started to make some in the field think of them as a replacement to LSTMs. This was compounded by the fact that Transformers deal with long-term dependancies better than LSTMs.</p>

<p>The Encoder-Decoder structure of the transformer made it perfect for machine translation. But how would you use it for sentence classification? How would you use it to pre-train a language model that can be fine-tuned for other tasks (<em>downstream</em> tasks is what the field calls those supervised-learning tasks that utilize a pre-trained model or component).</p>

<h2 id="openai-transformer-pre-training-a-transformer-decoder-for-language-modeling">OpenAI Transformer: Pre-training a Transformer Decoder for Language Modeling</h2>
<p>It turns out we don’t need an entire Transformer to adopt transfer learning and a fine-tunable language model for NLP tasks. We can do with just the decoder of the transformer. The decoder is a good choice because it’s a natural choice for language modeling (predicting the next word) since it’s built to mask future tokens – a valuable feature when it’s generating a translation word by word.</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/openai-transformer-1.png">
  <br>
  The OpenAI Transformer is made up of the decoder stack from the Transformer
</div>

<p>The model stacked twelve decoder layers. Since there is no encoder in this set up, these decoder layers would not have the encoder-decoder attention sublayer that vanilla transformer decoder layers have. It would still have the self-attention layer, however (masked so it doesn’t peak at future tokens).</p>

<p>With this structure, we can proceed to train the model on the same language modeling task: predict the next word using massive (unlabeled) datasets. Just, throw the text of 7,000 books at it and have it learn! Books are great for this sort of task since it allows the model to learn to associate related information even if they’re separated by a lot of text – something you don’t get for example, when you’re training with tweets, or articles.</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/openai-transformer-language-modeling.png">
  <br>
  The OpenAI Transformer is now ready to be trained to predict the next word on a dataset made up of 7,000 books.
</div>

<h2 id="transfer-learning-to-downstream-tasks">Transfer Learning to Downstream Tasks</h2>

<p>Now that the OpenAI transformer is pre-trained and its layers have been tuned to reasonably handle language, we can start using it for downstream tasks. Let’s first look at sentence classification (classify an email message as “spam” or “not spam”):</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/openai-transformer-sentence-classification.png">
  <br>

  How to use a pre-trained OpenAI transformer to do sentence clasification
</div>

<p>The OpenAI paper outlines a number of input transformations to handle the inputs for different types of tasks. The following image from the paper shows the structures of the models and input transformations to carry out different tasks.</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/openai-input transformations.png">
  <br>
</div>

<p>Isn’t that clever?</p>

<h2 id="bert-from-decoders-to-encoders">BERT: From Decoders to Encoders</h2>
<p>The openAI transformer gave us a fine-tunable pre-trained model based on the Transformer. But something went missing in this transition from LSTMs to Transformers. ELMo’s language model was bi-directional, but the openAI transformer only trains a forward language model. Could we build a transformer-based model whose language model looks both forward and backwards (in the technical jargon – “is conditioned on both left and right context”)?</p>

<p>“Hold my beer”, said R-rated BERT.</p>

<h3 id="masked-language-model">Masked Language Model</h3>

<p>“We’ll use transformer encoders”, said BERT.</p>

<p>“This is madness”, replied Ernie, “Everybody knows bidirectional conditioning would allow each word to indirectly see itself in a multi-layered context.”</p>

<p>“We’ll use masks”, said BERT confidently.</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/BERT-language-modeling-masked-lm.png">
  <br>
  BERT's clever language modeling task masks 15% of words in the input and asks the model to predict the missing word.
</div>

<p>Finding the right task to train a Transformer stack of encoders is a complex hurdle that BERT resolves by adopting a “masked language model” concept from earlier literature (where it’s called a Cloze task).</p>

<p>Beyond masking 15% of the input, BERT also mixes things a bit in order to improve how the model later fine-tunes. Sometimes it randomly replaces a word with another word and asks the model to predict the correct word in that position.</p>

<h3 id="two-sentence-tasks">Two-sentence Tasks</h3>

<p>If you look back up at the input transformations the OpenAI transformer does to handle different tasks, you’ll notice that some tasks require the model to say something intelligent about two sentences (e.g. are they simply paraphrased versions of each other? Given a wikipedia entry as input, and a question regarding that entry as another input, can we answer that question?).</p>

<p>To make BERT better at handling relationships between multiple sentences, the pre-training process includes an additional task: Given two sentences (A and B), is B likely to be the sentence that follows A, or not?</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/bert-next-sentence-prediction.png">
  <br>
  The second task BERT is pre-trained on is a two-sentence classification task. The tokenization is oversimplified in this graphic as BERT actually uses WordPieces as tokens rather than words --- so some words are broken down into smaller chunks.
</div>

<h3 id="task-specific-models">Task specific-Models</h3>
<p>The BERT paper shows a number of ways to use BERT for different tasks.</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/bert-tasks.png">
  <br>
</div>

<h3 id="bert-for-feature-extraction">BERT for feature extraction</h3>
<p>The fine-tuning approach isn’t the only way to use BERT. Just like ELMo, you can use the pre-trained BERT to create contextualized word embeddings. Then you can feed these embeddings to your existing model – a process the paper shows yield results not far behind fine-tuning BERT on a task such as named-entity recognition.</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/bert-contexualized-embeddings.png">
  <br>
</div>

<p>Which vector works best as a contextualized embedding? I would think it depends on the task. The paper examines six choices (Compared to the fine-tuned model which achieved a score of 96.4):</p>

<div class="img-div-any-width">
  <img src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/bert-feature-extraction-contextualized-embeddings.png">
  <br>
</div>

<h2 id="take-bert-out-for-a-spin">Take BERT out for a spin</h2>
<p>The best way to try out BERT is through the <a href="https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb">BERT FineTuning with Cloud TPUs</a> notebook hosted on Google Colab. If you’ve never used Cloud TPUs before, this is also a good starting point to try them as well as the BERT code works on TPUs, CPUs and GPUs as well.</p>

<p>The next step would be to look at the code in the <a href="https://github.com/google-research/bert">BERT repo</a>:</p>

<ul>
  <li>The model is constructed in <a href="https://github.com/google-research/bert/blob/master/modeling.py">modeling.py</a> (<code class="highlighter-rouge">class BertModel</code>) and is pretty much identical to a vanilla Transformer encoder.</li>
  <li>
    <p><a href="https://github.com/google-research/bert/blob/master/run_classifier.py">run_classifier.py</a> is an example of the fine-tuning process. It also constructs the classification layer for the supervised model. If you want to construct your own classifier, check out the <code class="highlighter-rouge">create_model()</code> method in that file.</p>
  </li>
  <li>
    <p>Several pre-trained models are available for download. These span BERT Base and BERT Large, as well as languages such as English, Chinese, and a multi-lingual model covering 102 languages trained on wikipedia.</p>
  </li>
  <li>BERT doesn’t look at words as tokens. Rather, it looks at WordPieces. <a href="https://github.com/google-research/bert/blob/master/tokenization.py">tokenization.py</a> is the tokenizer that would turns your words into wordPieces appropriate for BERT.</li>
</ul>

<p>You can also check out the <a href="https://github.com/huggingface/pytorch-pretrained-BERT">PyTorch implementation of BERT</a>. The <a href="https://github.com/allenai/allennlp">AllenNLP</a> library uses this implementation to <a href="https://github.com/allenai/allennlp/pull/2067">allow using BERT embeddings</a> with any model.</p>

<h2 id="acknowledgements">Acknowledgements</h2>

<p>Thanks to <a href="https://github.com/jacobdevlin-google">Jacob Devlin</a>, <a href="https://twitter.com/nlpmattg">Matt Gardner</a>, <a href="https://github.com/kentonl">Kenton Lee</a>,  <a href="https://twitter.com/markneumannnn">Mark Neumann</a>, and <a href="https://twitter.com/mattthemathman">Matthew Peters</a> for providing feedback on earlier drafts of this post.</p>

  </div>

  <div class="date">
    Written on December  3, 2018
  </div>

  
</article>

    </div>



    <!-- Begin Mailchimp Signup Form -->
    <link href="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/classic-10_7.css" rel="stylesheet" type="text/css">
    <style type="text/css">
    	#mc_embed_signup{background:#fff; clear:left; font:14px Helvetica,Arial,sans-serif; }
    	/* Add your own Mailchimp form style overrides in your site stylesheet or in this style block.
    	   We recommend moving this block and the preceding CSS link to the HEAD of your HTML file. */
    </style>
    <div id="mc_embed_signup">
    <form action="https://github.us19.list-manage.com/subscribe/post?u=2a4ade7dafcdbbf2eb4aae3cf&amp;id=f1f8c03f13" method="post" id="mc-embedded-subscribe-form" name="mc-embedded-subscribe-form" class="validate" target="_blank" novalidate="novalidate">
        <div id="mc_embed_signup_scroll">
    	<h2>Subscribe to get notified about upcoming posts by email</h2>
    <div class="mc-field-group">
    	<label for="mce-EMAIL">Email Address </label>
    	<input type="email" value="" name="EMAIL" class="required email" id="mce-EMAIL" aria-required="true">
    </div>
    	<div id="mce-responses" class="clear">
    		<div class="response" id="mce-error-response" style="display:none"></div>
    		<div class="response" id="mce-success-response" style="display:none"></div>
    	</div>    <!-- real people should not fill this in and expect good things - do not remove this or risk form bot signups-->
        <div style="position: absolute; left: -5000px;" aria-hidden="true"><input type="text" name="b_2a4ade7dafcdbbf2eb4aae3cf_f1f8c03f13" tabindex="-1" value=""></div>
        <div class="clear"><input type="submit" value="Subscribe" name="subscribe" id="mc-embedded-subscribe" class="button"></div>
        </div>
    </form>
    </div>
    <script type="text/javascript" src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/mc-validate.js"></script><script type="text/javascript">(function($) {window.fnames = new Array(); window.ftypes = new Array();fnames[0]='EMAIL';ftypes[0]='email';fnames[1]='FNAME';ftypes[1]='text';fnames[2]='LNAME';ftypes[2]='text';fnames[3]='ADDRESS';ftypes[3]='address';fnames[4]='PHONE';ftypes[4]='phone';fnames[5]='BIRTHDAY';ftypes[5]='birthday';}(jQuery));var $mcj = jQuery.noConflict(true);</script>
    <!--End mc_embed_signup-->

<div style="padding: 10px 0 10px 3%; color: #555; font-size:85%">
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="./The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time_files/88x31.png"></a><br>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

<br>
Attribution example:
<br>
<i>Alammar, Jay (2018). The Illustrated Transformer [Blog post]. Retrieved from <a href="https://jalammar.github.io/illustrated-transformer/">https://jalammar.github.io/illustrated-transformer/</a></i>

<br><br>
Note: If you translate any of the posts, let me know so I can link your translation to the original post. My email is in the <a href="http://jalammar.github.io/about">about page</a>.
</div>


    <div class="wrapper-footer">
      <div class="container">
        <footer class="footer">
          



<a href="https://github.com/jalammar"><i class="svg-icon github"></i></a>

<a href="https://www.linkedin.com/in/jalammar"><i class="svg-icon linkedin"></i></a>


<a href="https://www.twitter.com/jalammar"><i class="svg-icon twitter"></i></a>



        </footer>
      </div>
    </div>

    
	<!-- Google Analytics -->
	<script>
		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
		})(window,document,'script','//www.google-analytics.com/analytics.js','ga');

		ga('create', 'UA-71956058-1', 'auto');
		ga('send', 'pageview', {
		  'page': '/illustrated-bert/',
		  'title': 'The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)'
		});
	</script>
	<!-- End Google Analytics -->


  

</body></html>